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Abstract. With the wealth of high-throughput sequencing data generated by recent large-scale consor¬ 
tia, predictive gene expression modelling has become an important tool for integrative analysis of tran- 
scriptomic and epigenetic data. However, sequencing data-sets are characteristically large, and previously 
modelling frameworks are typically inefficient and unable to leverage multi-core or distributed process¬ 
ing architectures. In this study, we detail an efficient and parallelised MapReduce implementation of gene 
expression modelling. We leverage the computational efficiency of this framework to provide an integra¬ 
tive analysis of over fifty histone modification data-sets across a variety of cancerous and non-cancerous 
cell-lines. Our results demonstrate that the genome-wide relationships between histone modifications and 
mRNA transcription are lineage, tissue and karyotype-invariant, and that models trained on matched epi- 
genetic/transcriptomic data from non-cancerous cell-lines are able to predict cancerous expression with 
equivalent genome-wide fidelity. 


1 Introduction 

Computational frameworks for modelling gene expression as a function of gene-localised epigenetic features 
are becoming increasingly common in life sciences research. Previous studies by our lab If™ and others ma 
have leveraged the statistical power of modelling genes as observations of regulatory activity (versus variables 
in network-based analyses 0) to gain new insight into the function and interactions of transcription factors, 
histone modifications and DNA methylation. Recent applications include: inference of transcription factor roles 
from their respective binding motifs Q; identification of regulatory elements responsible for differential ex¬ 
pression patterns |[8|; exploring the relationship between gene expression and chromatin organisation El; and 
comparative analysis of the transcriptome across distant species a. 

Despite the wealth of high-throughput sequencing data made available by recent large-scale consortia (e.g. 
ENCODE), previous predictive modelling studies have focused on a very small number of cell-lines (typi¬ 
cally l-to-3 EE) despite the obvious benefits of broader, integrative analyses. We attribute this to the size 
of sequencing data, the computational complexity of naive model implementations and widespread inability 
of published frameworks to leverage multi-core and/or distributed architectures. In this study, we apply the 
MapReduce programming paradigm ifTOll with domain-specific complexity optimisations to provide efficient, 
parallelised gene expression modelling. 

A recent study by Jiang et al. has suggested that RNA- (transcriptomic) and ChIP-seq (epigenetic) data 
generated in the same condition {i.e. the same cell-line) introduces statistical bias and that specialised methods 
are necessary for accurately modeling the expression of cancer cells lH . This study investigates both of these 
concerns, exploiting the computational efficiency of our MapReduce implementation and conducting an inte¬ 
grative analysis of six histone modifications across eight dissimilar ENCODE cell-lines. Eirstly, we extend our 
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predictive modelling framework to include L^-regularisation, which is specifically designed to prevent over¬ 
fitting to experimental noise rather than meaningful biological relationships. We then quantify the extent of 
condition-specific bias by training and testing models on all 64 directed, pairwise combinations of cell-lines. 

2 Materials and Methods 

2.1 ENCODE cell-line data 

Matched mRNA transcript abundance (RNA-seq) and histone modification (ChIP-seq) data were downloaded 
from ENCODE ifT^ for the eight cell-lines summarised in Table These dissimilar cell-lines are the full set 
for which data is available for the histone modifications listed in Table The remaining histone modifica¬ 
tions available from ENCODE are unsuitable for this study as they assert their functional role in non-promoter 
regions {e.g. H3K36me3 in the 3'-UTR). The MapReduce implementation of gene expression modelling pre¬ 
sented in this study could be trivially extended to model more cell-lines if the data were made available. 


Table 1. All ENCODE cell-lines for which matched ChIP-seq data was available for the full set of histone modifications 
considered in this study (listed in Table|^. 


Cell-line 

Tier 

Description 

Lineage 

Tissue 

Karyotype 

A549 

2 

Alveolar carcinoma 

Endoderm 

Epithelium 

Cancer 

GM12878 

1 

B-lymphocyte 

Mesoderm 

Blood 

Normal 

Hl-hESC 

1 

Embryonic stem cells 

Inner cell mass 

Embryonic stem cell 

Normal 

HeLa-S3 

2 

Cervical carcinoma 

Ectoderm 

Cervix 

Cancer 

HepG2 

2 

Hepatocellular carcinoma 

Endoderm 

Liver 

Cancer 

HUVEC 

2 

Umbilical vein endothelial cells 

Mesoderm 

Blood vessel 

Normal 

K562 

1 

Leukemia 

Mesoderm 

Blood 

Cancer 

NHEK 

3 

Epidermal keratinocytes 

Ectoderm 

Skin 

Normal 


Table 2. All histone modifications considered in this study. The remaining histone modifications available from ENCODE 
are unsuitable for this study as they assert their functional role in non-promoter regions (e.g. H3K36me3 in the 3'-UTR). 


Histone modification 

Regulatory role 

Chromatin localisation 

H2A.Z 

Bivalency 

Euchromatin 

H3K4me3 

Activator/Bivalency 

Euchromatin 

H3K9ac 

Activator 

Euchromatin 

H3K9me3 

Repressor 

Constitutive heterochromatin 

H3K27ac 

Activator 

Euchromatin 

H3K27me3 

Repressor/Bivalency 

Facultative heterochromatin 


2.2 MapReduce 

MapReduce is programming paradigm which adapts the map-reduce functional programming construct for dis¬ 
tributed and fault-tolerant data processing on commodity hardware. Eirst developed by Google ifTOll . MapRe¬ 
duce is now widely used by companies such as Amazon, Eacebook, Google and Yahoo! for parallelised pro¬ 
cessing of data on terabyte and petabyte scales. Although many of the advanced features of MapReduce are less 
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relevant in the life sciences domain 113 {e.g. optimisation of network communications, as a single sequencing 
data-set can often fit in main memory), the ability to seamlessly interleave sequential and parallel processing 
can be leveraged to reduce the processing time of gene expression modelling linearly in the number of available 
CPU cores. 

A program implemented using the MapReduce paradigm consists of a sequence, (/ri, pi,/j, 2 ,/ 02 , • ■ ■ 
of mappers (p^) and reducers {pr) operating over (key; value) pairs. Formally, a MapReduce program executes 
the following steps on input Uq until the final reducer (pji) halts llT4l : 


Algorithm 1 MapReduce ((pi, pi, p 2 , P 2 , ■ ■ ■, Pr, Pr),U o) 
for r = 1, 2,..., i? do 
Ur ■«- MAP(Ur-l) 

Vr SHUFFLE([/)) 

Ur REDUCE(Ur) 

return Ur 

function MAP(?7r-i) 

^ 0 

for (k; v) G Ur-i do 

[/; ^ U^U pr{{k;v)) 

return U) 

function Shuffle(U)) 

Vr^% 

for each unique key A: G [/) do 

Vk,r {k-, {V1,V2, . . . , Vm}) ■■ {k,Vm) G U( 

Vr^VrU Vk,r 

return Vr 

function Reduce(14) 

Ur 

for each 14,^ G 14 do 

Ur ■(— UrU pr {{k\ Vk,r)) 

return Ur 


The computational benefit of MapReduce follows from its inherent parallelisability, as many instances of pr 
are able to process their (key; value) simultaneously (likewise with p^, although all instances of Pr-i must halt 
before any pr can commence). The following sections detail mapper and reducer implementations for various 
stages of the predictive gene expression modelling pipeline described in our previous studies 01I2I3L 

2.3 Quantifying transcriptional regulatory interactions 

The strength of association between a gene, m G (1,2,..., M), and epigenetic feature, n G (1,2,..., N), can 
be calculated from a ChlP-seq data-set; 


a^m.n = ^ (j){r,m), 

r^Rn 

\d{r ,m)\<d* 

where is the set of ChlP-seq reads for n, d{r, m) is the distance (bp) separating read r from the TSS of 
m, and (p maps a gene-read pair to their strength of association. The maximum bin-width, d*, is typically set 
to 2000 to approximate the average width of ChlP-seq binding regions. Different implementations of p are 
used for histone modifications (constrained sum-of-tags) versus transcription factors (exponentially decaying 
affinity) due to their dissimilar ChlP-seq binding profiles fJl: 
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for histone modifications 
for transcription factors 

where hyperparameter do controls the strength of exponential decay for quantifying transcription factor inter¬ 
actions and is typically set to do = 5000. The resultant matrix of gene-level epigenetic scores, X G , is 

then log (or arsinh)-transformed and quantile-normalised for use in a regression model. 

Given ChIP-seq data for epigenetic feature n represented in UCSC wiggle (.WIG) format: 


m) = 


1 

exp (- 


variableStep 

chromStartA 

chromStartB 

. .. etc . .. 


chrom= chrW 
dataValueA 
dataValueB 

... etc ... 


[span=windowSize] 


each column X* „ G TZ^ : X* „ = col„ (X) of the epigenetic score matrix can be efficiently calculated using 
MapReduce, as demonstrated below. Similar formulations can be derived for other ChIP-seq file formats. 


Algorithm 2 MREpigeneticS 

.cores(X*,„) 


procedure Mapper p ((X*_„ 

{locus', value))) 

for each gene m do 



if |d(ZocMS, m)| < d* 

then 


TLMn{{xm,n', value x fp{locus, 

m))) 

procedure Reducer p {{xm. 

n; {V\,V2, ■ ■ ■ 

,Vk})) 

EMn(^(^X,n,n-, Y.iLl Vi^ ^ 

) 



2.4 Linear regression with least squares fitting 

Suppose X e TZ^^^ is a matrix of gene-level epigenetic scores, where M is the number of genes and N is 
the number of epigenetic variables (M ^ N). It is commonplace to model the relationship between X and a 
vector of gene-expression values, Y G TZ^, in the following form: 


y = X/3 + e, 

where /3 parameterises the linear relationship between gene expression and epigenetics, and £ are the gene- 
specific errors. Such models can be fitted using ordinary least squares: 


/3 = argmin (||y — X/?|p) 
I3en" 

= (xTx)”^x^r, 

yielding the following model-based predictions of gene expression, Y: 

F = X/3. 
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Given two matrices, A e and B € the product C G TZ^^^ : C = A x B can be 

formulated as follows: 


Cj /j; G C : /j; — ^ X 

where: 


^i,* G TZ^ : Ai^^, — coli (A^) , 

G TZ^ ■■ = colfe (B). 

This formulation of matrix multiplication can be implemented using the MapReduce paradigm, as demonstrated 
below. 


Algorithm 3 MRMultiply(A, B) 

procedure Mapper ha ({A; aij)) 
for fc = 1, 2,..., Z do 

EMIT((Ci^fc5 Cii,j )) 

procedure Mapper hb ({B; foj,fe)) 

for i = 1, 2,..., A do 

EMIT((ci,fc;6j,fe)) 

procedure Reducer p {{dy, {Ai,*, B*,*})) 

EMiT((ci,fe; X 


Our implementation of linear regression with least squares fitting involves decomposing (3 into the product 
A-^B, where A G : A = MRMultiply(X^, X) and B G : B = MRMultiply(X^, F). 

The product A~^B is calculated using standard matrix multiplication due to the unnecessary overhead of 
MapReduce for small matrices. 


2.5 Regularised least squares regression 

Regularisation is a common method of overcoming the issue of over-fitting regression-based models to ex¬ 
perimental noise rather than meaningful biological relationships. Regularisation involves penalising the fitted 
parameters, /?, by an empirically-tuned hyperparameter. A: 

/3 = argmin(||y - X/?|p -f A||/3|p) . 

Presuming 11 • 11 is the (Euclidean) norm, our MapReduce implementation can be trivially extended to 
support regularisation (implementing ridge regression). Specifically, given: 


y 


X 

0 

, X = 



where /at is the N x N identity matrix, it follows that: 
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/3 = argmin (||F — X/?|p ) 

/3g7?,« ' 

= (x^x) ^ X^Y 
= {X^X + \I„y^X^Y. 

This implementation yields the same asymptotic time complexity as ordinary least squares regression. More¬ 
over, the existence theorem for general ridge regression demonstrates that it is always possible to tune A (e.g. 
using cross-validation) to reduce the mean square error of model predictions 0151161 . This is particularly impor¬ 
tant when introducing a large number of epigenetic variables into a predictive model; e.g. a systematic analysis 
of the roles of dozens of transcription factors from their ChIP-seq binding profiles. In this study, A is assigned 
the largest possible value such that the mean 10-fold cross-validated error is within 1 standard error of the 
minimum (solved iteratively). 

Unlike the norm, the norm is often used to enforce sparsity in f} under the assumption that most 
variables in X are practically unrelated to Y. This is less relevant in the context of gene expression modelling 
due to the well-established functional importance of epigenetic regulators for which ChIP-seq data is widely 
available. Moreover, the norm is not differentiable and thus not amenable to a closed-form MapReduce 
solution, although the parallelisation of iterative solutions is discussed elsewhere im. 


3 Results 

3.1 Histone modifications are predictive of gene expression in both cancerous and normal cell-lines 

L^-regularised linear regression models of genome-wide mRNA transcript abundance were constructed as func¬ 
tions of the following histone modifications: H2A.Z, H3K4me3, H3K9ac, H3K9me3, H3K27ac and H3K27me3. 
For each model, the regularisation parameter. A, was fitted using 10-fold cross-validation. The adj. E? perfor¬ 
mance of each model is presented in Fig. along with a density plot of predicted (Y) versus measured (Y) 
transcript abundance. It is evident that histone modifications are accurate predictors of gene expression in both 
cancerous (top row, mean adj. = 0.608) and normal cell-lines (bottom row, mean adj. = 0.581), despite 
recent studies suggesting that specialised models are necessary to appropriately model cancerous cells 

Fig. [^presents the results of hierarchically clustering cell-lines by mRNA transcript abundance residuals 
(e = Y — Y). Interestingly, the three mesodermal derivatives GM12878, K562 and HUVEC form a distinct 
cluster. RNA-sequencing data for the least similar cell-line (A549) was generated at Cold Spring Harbor Labo¬ 
ratory whereas all other transcriptomic data was generated at the California Institute of Technology, suggesting 
that batch effects may be a contributing factor. It is also evident that the expression levels of many genes are 
consistently over- or under-estimated across all eight cell-lines. Taken together, these results indicate that gene- 
specific residuals are non-random but instead indicative of genes that are inherently difficult to model from 
histone modification data. The existence of genes with transcriptional activity apparently decoupled from the 
local epigenetic landscape has been explored in detail in our previous study El. 

3.2 The regulatory roles of histone modifications are cell-line invariant 

To assess the extent to which condition-specific bias influences the reported accuracy of gene expression pre¬ 
dictions, we trained and tested models on all 64 directed, pairwise combinations of cell-lines. The adj. R^ per¬ 
formance for these models are presented in Fig.j^a). These results demonstrate significant non-symmetry, with 
dissimilarity between columns (predictions) but not rows (training observations). This demonstrates that the 
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RNA-seq mRNA transcript abundance (Y) 


Fig. 1. Density plots of predicted (Y) versus measured (Y) mRNA transcript abundance abundance for cancerous (top row, 
mean adj. FI? = 0.608) and normal cell-lines (bottom row, mean adj. I? = 0.581). The adj. F? performance and A 
regularisation parameter (fitted using 10-fold cross validation) is reported for each cell-line. 


transcriptional regulatory roles of histone modifications are cell line invariant at a genome-wide level (within 
the constraints of a linear model); e.g. A549 and GM12878 expression can be accurately predicted by mod¬ 
els trained on any cell-line, despite their diversity in lineage, tissue and karyotype. These results are further 
supported by Fig.j^b), which demonstrates consistency in the fitted model parameters, /3, across all cell-lines. 

It is worth noting that that models trained and tested using data from a single cell-line (boldfaced along 
the diagonal of Fig. [^a)) only marginally outperform models trained on dissimilar cell-lines and, moreover, 
that these margins are significantly less than the inherent variation between columns. These findings suggest 
that, in the context of gene expression modelling, training and testing models on data generated under the same 
experimental conditions (i.e. the same cell-line) is not a significant source of statistical bias. 


3.3 MapReduce reduces the asymptotic time complexity of gene expression modelling 

For M genes and a ChIP-seq data-set containing R mapped reads, the asymptotic time complexity class of 
generating a column -Y* „ of X is 0{MR). By first preprocessing the list of gene TSS loci (invariant between 
epigenetic datasets) into a balanced binary search tree and observing that the vast majority of reads are within 
d* bp of exactly zero-or-one gene, our MapReduce implementation of calculating JY* „ yields the following 
complexity when distributed across P MapReduce nodes: 


MREpigeneticScores G 0 




which must be completed separately for each epigenetic feature, n G (1,2,..., N). 
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Fig. 2. Hierarchical clustering of cell-lines by mRNA transcript abundance residuals {e = Y — Y). The three mesodermal 
derivatives GM12878, K562 and HUVEC cluster together, suggesting that residuals are partially non-random and instead 
convey meaningful biological information. Consistently, it is evident that the expression levels of many genes are poorly 
predicted across all eight cell-lines, presumably capturing divergence from histone modification-mediated regulation (ex¬ 
plored in detail in our previous study 12). 


For X G and Y G , the asymptotic time complexity of ordinary least squares fitting (3 = 

/(X, Y) can also be derived: 

xr A=xrx B=XTy A-i A-iB 

Observing that R ^ M ^ N for gene expression modelling and by distributing the calculation of A and B 
across P MapReduce nodes, the overall complexity reduces to: 


MRExpressionModelling G 0 


NRlog{M)\ 

P ') 


thus this MapReduce implementation of gene expression modelling yields an optimal 0{P) improvement in 
asymptotic time complexity without the need to parallelise matrix inversion or transpose operations. 


4 Discussion 

Many previous predictive modelling studies have been limited in scope to 1-3 cell-lines due to the compu¬ 
tational expense of integrating and analysing large, high-throughput sequencing data-sets. In this study, we 
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Fig. 3. (a) Genome-wide accuracy of mRNA transcript abundance predictions (adj. R^) for models trained and tested on 
each pairwise combination of cell lines. These results are strikingly non-symmetric, with significant dissimilarity between 
columns (predictions) but not rows (training observations), (b) Distribution of each fitted model parameter. Pm, across all 
cell-lines considered in this study. 


introduced a MapReduce implementation of gene expression modelling that is able to obtain a full 0{P) im¬ 
provement in asymptotic time complexity when distributed across P CPUs (e.g. as part of multi-core PC or 
high-performance cluster). This implementation was subsequently applied to an integrative analysis of more 
than 50 epigenetic and matched transcriptomic data-sets across 8 dissimilar ENCODE cell-lines. 

Despite recent studies presenting specialised methods for modelling cancerous gene expression ifTTl . we 
find no evidence of variation in the statistical relationship between histone modifications and mRNA transcript 
abundance in normal-versus-cancerous cell-lines. Although our results demonstrate that some cell-lines are 
inherently more difficult to model than others, this trait appears to be more closely associated with the extent 
of cellular differentiation than carcinogenic state; e.g. models of hl-hESC embryonic stem cells perform 12% 
worse than terminally-differentiated GM12878 lymphoblasts. Although the NHEK (Normal Human Epidermal 
Keratinocytes) cell-line is both terminally-differentiated and exhibits the worst-performing models, this may be 
attributed to the phenotypic plasticity of keratinocytes between epithelial and mesenchymal states (necessary 
for wound healing). We therefore speculate that the predictability of a cell-line’s genome-wide expression levels 
from epigenetic data is proportional to its transcriptomic rigidity; i.e. cells with signal-induced phenotypic 
plasticity are less likely to exhibit a stable, predictive epigenome. 

Interestingly, hierarchical clustering of the 8 investigated cell-lines by mRNA transcript abundance residu¬ 
als (gene-level prediction errors) was able to group the closely-related, mesodermal-derivative cell-lines GM12878, 
K562 and HUVEC; again, carcinogenic state appeared to have little effect on the propensity of two cell-lines 
to cluster together. Taken together with the observation that many genes exhibited large and consistent residu¬ 
als across all cell-lines, these results suggest that gene-level residuals are non-random and, moreover, that the 
transcriptional activity of many genes are decoupled from their local epigenetic landscape IJl. 
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